OS Primitives in Python
The Incident: A Service That Ignored SIGTERM
In 2021, a fintech startup deployed a new version of their Python payment processor. During a rolling restart, Kubernetes sent SIGTERM to the old pods and waited 30 seconds before sending SIGKILL. The old pods ignored SIGTERM entirely - the Python service had no signal handler registered. The result: 30 seconds of requests being handled by pods that were simultaneously being force-killed. In-flight transactions were corrupted. The default Python behavior on SIGTERM is to raise SystemExit immediately, which terminates the process without flushing write buffers, closing database connections, or completing in-progress requests.
The fix is 12 lines of Python. But you need to understand what signals are, how the OS delivers them, and what constraints Python imposes on signal handlers.
The os Module: Python's POSIX Facade
The os module is a thin Python wrapper around the POSIX C API. Most os.* calls translate directly to a single system call with minimal overhead. When you call os.getpid(), the interpreter executes the getpid(2) syscall and returns the integer. There is no caching, no abstraction overhead.
Process Identity
import os
# Current process identifier - unique on this machine right now
pid = os.getpid()
# Parent process identifier - who started us
ppid = os.getppid()
# Real user ID - who owns this process
uid = os.getuid()
# Effective user ID - what permissions we have right now
# (may differ from uid after setuid operations)
euid = os.geteuid()
# Real group ID
gid = os.getgid()
# Effective group ID
egid = os.getegid()
print(f"PID: {pid}, PPID: {ppid}")
print(f"UID: {uid}, EUID: {euid}")
# Get the user's login name - reads /etc/passwd
# May fail in minimal container environments
import pwd
user_info = pwd.getpwuid(uid)
print(f"Username: {user_info.pw_name}, Home: {user_info.pw_dir}")
Working Directory and Environment
import os
# Current working directory - the kernel tracks this per-process
cwd = os.getcwd()
print(f"CWD: {cwd}")
# Change it - affects all subsequent relative path resolution
os.chdir("/tmp")
print(os.getcwd()) # /tmp
# The environment is a dict-like object backed by C's environ[]
print(os.environ.get("HOME"))
print(os.environ.get("PATH"))
# Set an environment variable - visible to this process and
# all children forked/spawned after this point
os.environ["MY_SERVICE_PORT"] = "8080"
# A safe read with a default
port = int(os.environ.get("SERVICE_PORT", "8000"))
# Check path properties without raising exceptions
print(os.path.exists("/etc/hosts")) # True
print(os.path.isfile("/etc/hosts")) # True
print(os.path.isdir("/etc")) # True
# Expand ~ to home directory
home_config = os.path.expanduser("~/.config/app/settings.json")
Inspecting System Resources
import os
# Number of logical CPU cores (includes hyperthreading)
cpu_count = os.cpu_count()
print(f"CPUs: {cpu_count}")
# POSIX system constants via sysconf
page_size = os.sysconf("SC_PAGE_SIZE") # typically 4096 bytes
max_files = os.sysconf("SC_OPEN_MAX") # system-wide FD limit
# Process CPU time usage
times = os.times()
print(f"User CPU time: {times.user:.3f}s")
print(f"System CPU time: {times.system:.3f}s")
print(f"Elapsed wall: {times.elapsed:.3f}s")
# Disk usage for a filesystem
stat = os.statvfs("/")
total_bytes = stat.f_blocks * stat.f_frsize
free_bytes = stat.f_bfree * stat.f_frsize
used_bytes = total_bytes - free_bytes
print(f"Disk: {used_bytes//(1024**3)} GB used / {total_bytes//(1024**3)} GB total")
Process Creation: os.fork() Internals
fork() is the POSIX primitive for creating a new process. The calling process is duplicated. The child is an exact copy of the parent at the moment of the fork call, and both processes continue executing from the next instruction.
The return value distinguishes parent from child:
- In the parent:
fork()returns the child's PID (a positive integer) - In the child:
fork()returns0 - On error:
fork()returns-1(raisesOSErrorin Python)
import os
import time
print(f"Before fork: PID={os.getpid()}")
pid = os.fork()
if pid == 0:
# This code runs only in the child process
print(f"Child: PID={os.getpid()}, PPID={os.getppid()}")
time.sleep(0.5)
print("Child exiting")
os._exit(0) # MUST use os._exit, NOT sys.exit() - see explanation below
else:
# This code runs only in the parent process
print(f"Parent: PID={os.getpid()}, child PID={pid}")
# Wait for the child to avoid creating a zombie
child_pid, status = os.waitpid(pid, 0)
print(f"Child {child_pid} exited with code {os.WEXITSTATUS(status)}")
Why os._exit() instead of sys.exit() in the child?
sys.exit() raises SystemExit, triggering Python's normal shutdown: atexit handlers run, __del__ methods are called, file buffers are flushed. In a forked child, this is dangerous:
- The child shares the parent's open file descriptors. Flushing a buffered file in the child can write duplicate data to a file the parent is also writing.
atexithandlers may close database connections that the parent is still using.__del__methods may decrement reference counts on objects the parent still holds.
os._exit() calls the _exit(2) syscall directly - the kernel immediately terminates the process. No Python cleanup, no buffer flushes, no atexit handlers. Always use it in forked children.
Copy-on-Write (COW) Semantics
The kernel does not physically copy all memory pages at fork time. Instead it uses copy-on-write: both parent and child initially share the same physical memory pages, marked read-only. The first write to any page triggers a page fault; the kernel then copies that single page and maps the private copy into the writing process's address space. Reads are free. Only writes pay the copy cost.
After fork():
Parent address space Child address space
┌──────────────────┐ ┌──────────────────┐
│ text (code) │─────────│ text (code) │ shared (read-only, never COW'd)
├──────────────────┤ ├──────────────────┤
│ data (globals) │─────────│ data (globals) │ shared until written
├──────────────────┤ ├──────────────────┤
│ heap (PyObjs) │─────────│ heap (PyObjs) │ shared until written
├──────────────────┤ ├──────────────────┤
│ stack │ │ stack │ separate immediately
└──────────────────┘ └──────────────────┘
│
│ first write by either process triggers:
▼
kernel faults in ──► copies the page ──► maps private copy to writer
This is why multiprocessing in Python is memory-expensive for write-heavy workloads: CPython uses reference counting, meaning nearly every object access writes to the object's ob_refcnt field. Even reading a list increments a refcount - causing a COW fault on the page containing that list object. A Python process with 2 GB of heap that forks and then accesses existing objects may eventually use 4 GB of physical RAM.
What Is Inherited Across fork()
Inherited by child (exact copies at fork time):
- Open file descriptors (same FD numbers, share kernel file description)
- Signal dispositions and signal masks
- Memory mappings (COW for private, truly shared for MAP_SHARED)
- Environment variables (os.environ)
- Current working directory
- User/group IDs (real and effective)
- Resource limits (ulimit values)
- CPU affinity mask
NOT inherited / reset in child:
- PID (child gets a new PID)
- PPID (child's PPID = parent's PID)
- Signal pending set (cleared in child)
- File locks held by threads (lock state may be inconsistent)
- Timers (setitimer resets to 0)
- Threads (CRITICAL: only the forking thread runs in child)
The last point deserves emphasis. If the parent is multi-threaded and another thread holds a malloc lock, a logging lock, or a threading.Lock at the precise moment of fork(), the child starts with that lock permanently held by a non-existent thread. Any attempt to acquire it in the child causes a deadlock. This is why os.fork() in multi-threaded Python programs is treacherous and multiprocessing uses spawn or forkserver as its default start method on macOS and Windows.
The os.exec* Family: Replacing the Process Image
exec() replaces the current process image with a new program. After a successful exec(), nothing from the calling Python program remains - the interpreter is gone, replaced by the new executable. exec() never returns on success.
import os
# execv(path, args): explicit absolute path, args as a list
# The first element of args is conventionally the program name
os.execv("/bin/ls", ["/bin/ls", "-la", "/tmp"])
# execvp(file, args): searches PATH to find `file`
os.execvp("ls", ["ls", "-la", "/tmp"])
# execve(path, args, env): explicit path + explicit environment dict
# The child will see ONLY the environment variables you provide
os.execve(
"/usr/bin/python3",
["python3", "-c", "import os; print(os.environ)"],
{"PATH": "/usr/bin:/bin", "HOME": "/root"}
)
The classic fork + exec pattern creates a child process running a different program - this is exactly what shells do, and what subprocess.Popen does internally:
import os
pid = os.fork()
if pid == 0:
# Child: replace with a new program
# If exec fails (program not found, permission denied),
# fall through and exit immediately
try:
os.execvp("python3", ["python3", "-c", "print('hello from exec')"])
except OSError as e:
print(f"exec failed: {e}", flush=True)
finally:
os._exit(127) # 127 is the convention for "command not found"
else:
# Parent: wait for child to finish
_, status = os.waitpid(pid, 0)
print(f"Child exited with code {os.WEXITSTATUS(status)}")
subprocess Internals: fork() + exec() Under the Hood
subprocess.Popen is fork() + exec() with a lot of careful bookkeeping. Understanding this helps you debug subprocess issues and use it correctly.
import subprocess
# This is roughly what subprocess.Popen does internally:
# 1. Call os.fork()
# 2. In the child:
# a. Set up pipe redirections for stdin/stdout/stderr
# b. If close_fds=True: close all FDs > 2
# c. Call os.execvpe(args[0], args, env)
# 3. In the parent:
# a. Record the child PID
# b. Return the Popen object
result = subprocess.run(
["ls", "-la", "/tmp"],
capture_output=True,
text=True,
timeout=10,
)
print(result.stdout)
print(f"Return code: {result.returncode}")
close_fds=True (the default on Unix since Python 3.2) closes all file descriptors above 2 in the child before calling exec(). This prevents FD leaking into subprocesses:
import subprocess
import socket
# Without close_fds, this socket is visible inside the child process
sock = socket.socket()
sock.connect(("db.internal", 5432))
# close_fds=True: all FDs above 2 are closed before exec
# The database socket does NOT leak into the subprocess
proc = subprocess.Popen(
["ls", "/tmp"],
close_fds=True, # default True on Unix; be explicit for clarity
stdout=subprocess.PIPE,
)
stdout, _ = proc.communicate()
proc.wait()
# Alternative: set FD_CLOEXEC on an individual FD
# This closes the FD automatically on exec() in the child
import fcntl
flags = fcntl.fcntl(sock.fileno(), fcntl.F_GETFD)
fcntl.fcntl(sock.fileno(), fcntl.F_SETFD, flags | fcntl.FD_CLOEXEC)
Signal Handling
Signals are asynchronous notifications delivered to a process by the kernel or by another process. They interrupt the process at an arbitrary point during execution - between any two Python bytecode instructions.
Signal Reference
Signal Default Meaning
─────────────────────────────────────────────────────────────
SIGTERM Terminate Polite shutdown request (Kubernetes, kill)
SIGINT Terminate Keyboard Ctrl-C
SIGHUP Terminate Terminal closed; conventionally: reload config
SIGKILL Terminate Immediate kill - CANNOT be caught or ignored
SIGCHLD Ignore Child process changed state
SIGUSR1 Terminate User-defined (e.g., dump stats to log)
SIGUSR2 Terminate User-defined (e.g., toggle debug mode)
SIGPIPE Terminate Write to closed pipe or socket
SIGALRM Terminate Timer expired (from alarm(2))
SIGWINCH Ignore Terminal window resized
Registering Signal Handlers
import signal
import os
import time
# Global flag - the cleanest approach for signal-to-mainloop communication
_shutdown_requested = False
def handle_shutdown(signum, frame):
"""
Signal handler: must be reentrant and async-signal-safe.
Setting a simple boolean is always safe.
"""
global _shutdown_requested
# Do NOT: call logging.info, socket.send, sys.exit
# Do: set a flag and return immediately
_shutdown_requested = True
def handle_sighup(signum, frame):
"""Reload configuration on SIGHUP - common pattern for daemons."""
global _reload_config
_reload_config = True
_reload_config = False
signal.signal(signal.SIGTERM, handle_shutdown)
signal.signal(signal.SIGINT, handle_shutdown) # Ctrl-C
signal.signal(signal.SIGHUP, handle_sighup)
# Ignore SIGPIPE - write(2) returns EPIPE instead of killing the process
# Essential for servers that write to client sockets that may close
signal.signal(signal.SIGPIPE, signal.SIG_IGN)
# Restore OS default for a signal
signal.signal(signal.SIGUSR1, signal.SIG_DFL)
print(f"Server PID: {os.getpid()}")
print(f"Test: kill -TERM {os.getpid()}")
while not _shutdown_requested:
if _reload_config:
print("Reloading config...")
_reload_config = False
time.sleep(0.1)
print("Graceful shutdown initiated.")
Async-Safety Constraints
Signal handlers are called asynchronously between Python bytecodes. However, CPython does NOT call the Python handler from the C signal handler directly. Instead:
- The OS delivers a signal and invokes the C-level handler
- The C-level handler sets a
_Py_atomic_intflag and writes a byte to a self-pipe (used in asyncio) or sets a flag the eval loop checks - The eval loop checks the flag between bytecodes and dispatches the Python handler on the main thread
This means Python signal handlers are always called from the main thread, between bytecode instructions - a far safer model than raw C. Still, you must avoid:
logging.*calls (logging uses threading locks)socket.*calls (non-reentrant, allocates memory)- Large object allocation (may trigger GC which holds internal locks)
sys.exit()(raises SystemExit which can leave handler stack inconsistent)
Safe operations: flag = True, event.set(), list.append(1), writing a single byte to a pipe.
Signal Handling in asyncio: loop.add_signal_handler()
signal.signal() does not work reliably in asyncio programs because the event loop's select/epoll call may be blocking, and signal delivery via the self-pipe requires the event loop to be running and polling.
import asyncio
import signal
import os
async def main():
loop = asyncio.get_running_loop()
shutdown_event = asyncio.Event()
def _on_shutdown(signum):
print(f"\nSignal {signal.Signals(signum).name} received.")
# call_soon_threadsafe is safe from a signal handler context
loop.call_soon_threadsafe(shutdown_event.set)
# Registers handler with the event loop's internal self-pipe mechanism
# The callback runs in the event loop thread, integrated with async
loop.add_signal_handler(signal.SIGTERM, _on_shutdown, signal.SIGTERM)
loop.add_signal_handler(signal.SIGINT, _on_shutdown, signal.SIGINT)
print(f"Running, PID={os.getpid()}")
print(f"Test graceful shutdown: kill -TERM {os.getpid()}")
await shutdown_event.wait()
print("Shutdown event set. Cleaning up...")
asyncio.run(main())
Production SIGTERM Graceful Shutdown Pattern
This is the pattern used by uvicorn, Gunicorn, and most production Python servers. The key insight: do not stop accepting new requests immediately on SIGTERM. Drain in-flight requests first.
import asyncio
import signal
import logging
import os
from typing import Set
logger = logging.getLogger(__name__)
class GracefulServer:
def __init__(self, shutdown_timeout: float = 30.0):
self.shutdown_event = asyncio.Event()
self.shutdown_timeout = shutdown_timeout
self._active_tasks: Set[asyncio.Task] = set()
async def handle_request(self, reader: asyncio.StreamReader,
writer: asyncio.StreamWriter) -> None:
peer = writer.get_extra_info("peername")
task = asyncio.current_task()
self._active_tasks.add(task)
try:
data = await asyncio.wait_for(reader.read(8192), timeout=5.0)
if data:
response = (
b"HTTP/1.1 200 OK\r\n"
b"Content-Length: 2\r\n"
b"Connection: close\r\n"
b"\r\nOK"
)
writer.write(response)
await writer.drain()
except asyncio.TimeoutError:
logger.warning(f"Request timeout from {peer}")
except Exception as e:
logger.error(f"Request error from {peer}: {e}")
finally:
writer.close()
try:
await writer.wait_closed()
except Exception:
pass
self._active_tasks.discard(task)
async def serve(self, host: str = "0.0.0.0", port: int = 8080) -> None:
loop = asyncio.get_running_loop()
def _request_shutdown(signum: int) -> None:
signame = signal.Signals(signum).name
logger.info(f"Received {signame} - initiating graceful shutdown")
loop.call_soon_threadsafe(self.shutdown_event.set)
loop.add_signal_handler(signal.SIGTERM, _request_shutdown, signal.SIGTERM)
loop.add_signal_handler(signal.SIGINT, _request_shutdown, signal.SIGINT)
server = await asyncio.start_server(self.handle_request, host, port)
logger.info(f"Listening on {host}:{port} - PID {os.getpid()}")
async with server:
# Phase 1: Run until shutdown signal
await self.shutdown_event.wait()
logger.info("Stopping acceptance of new connections...")
server.close()
await server.wait_closed()
# Phase 2: Wait for active requests to complete
if self._active_tasks:
logger.info(
f"Draining {len(self._active_tasks)} in-flight requests "
f"(timeout: {self.shutdown_timeout}s)..."
)
try:
await asyncio.wait_for(
asyncio.gather(*self._active_tasks, return_exceptions=True),
timeout=self.shutdown_timeout,
)
except asyncio.TimeoutError:
logger.warning(
f"Drain timeout after {self.shutdown_timeout}s - "
f"forcing shutdown with {len(self._active_tasks)} tasks remaining"
)
for task in list(self._active_tasks):
task.cancel()
logger.info("Server shut down cleanly.")
if __name__ == "__main__":
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)s %(message)s",
)
server = GracefulServer(shutdown_timeout=30.0)
asyncio.run(server.serve())
SIGCHLD and Zombie Processes
When a child process exits, it enters the zombie state (Z in ps). The kernel keeps the child's PID and exit status in the process table until the parent calls wait(). If the parent never calls wait(), the zombie persists until the parent exits.
import os
import signal
import time
def reap_children(signum: int, frame) -> None:
"""
SIGCHLD handler: collect all finished children without blocking.
Uses WNOHANG so we don't block if a child is still running.
"""
while True:
try:
# -1 means: wait for any child process
# WNOHANG: return (0, 0) immediately if no child has exited
pid, status = os.waitpid(-1, os.WNOHANG)
if pid == 0:
break # no more children ready to be reaped
if os.WIFEXITED(status):
code = os.WEXITSTATUS(status)
print(f"Reaped child PID={pid}, exit_code={code}")
elif os.WIFSIGNALED(status):
sig = os.WTERMSIG(status)
print(f"Reaped child PID={pid}, killed by signal {sig}")
except ChildProcessError:
break # no children exist at all
signal.signal(signal.SIGCHLD, reap_children)
# Spawn five children with staggered exit times
for i in range(5):
pid = os.fork()
if pid == 0:
time.sleep(0.1 * (i + 1))
print(f"Child {i} (PID {os.getpid()}) exiting normally")
os._exit(i)
print("Parent running. Zombies would show as 'Z' in ps aux...")
time.sleep(3)
print("Done. All children reaped by SIGCHLD handler.")
On Linux, setting SIGCHLD to SIG_IGN instructs the kernel to auto-reap children without delivering SIGCHLD:
# Linux-specific: kernel auto-reaps all children
# Portable alternative: SIGCHLD handler with waitpid(-1, WNOHANG)
signal.signal(signal.SIGCHLD, signal.SIG_IGN)
The resource Module: Process Limits
resource.getrlimit and resource.setrlimit control per-process resource limits. These are enforced by the kernel and cannot be raised above the hard limit without root privileges (or CAP_SYS_RESOURCE).
import resource
# Print all resource limits
LIMITS = {
"RLIMIT_NOFILE": resource.RLIMIT_NOFILE, # max open file descriptors
"RLIMIT_AS": resource.RLIMIT_AS, # max virtual address space (bytes)
"RLIMIT_CPU": resource.RLIMIT_CPU, # max CPU time (seconds)
"RLIMIT_DATA": resource.RLIMIT_DATA, # max heap size (bytes)
"RLIMIT_STACK": resource.RLIMIT_STACK, # max stack size (bytes)
"RLIMIT_NPROC": resource.RLIMIT_NPROC, # max processes for this user
}
INF = resource.RLIM_INFINITY
for name, const in LIMITS.items():
soft, hard = resource.getrlimit(const)
s = "unlimited" if soft == INF else str(soft)
h = "unlimited" if hard == INF else str(hard)
print(f" {name:<20} soft={s:>10} hard={h:>10}")
Raising the File Descriptor Limit
import resource
def maximize_fd_limit() -> int:
"""
Raise the FD soft limit to the hard limit.
Call at server startup, before accepting connections.
Returns the new soft limit.
"""
soft, hard = resource.getrlimit(resource.RLIMIT_NOFILE)
print(f"FD limits: soft={soft}, hard={hard}")
if soft < hard:
target = hard if hard != resource.RLIM_INFINITY else 65536
try:
resource.setrlimit(resource.RLIMIT_NOFILE, (target, hard))
soft, _ = resource.getrlimit(resource.RLIMIT_NOFILE)
print(f"Raised FD soft limit to {soft}")
except ValueError as e:
print(f"Could not raise limit: {e}")
return soft
max_fds = maximize_fd_limit()
Measuring Memory Usage
import resource
import os
def memory_stats() -> dict:
usage = resource.getrusage(resource.RUSAGE_SELF)
# ru_maxrss: kilobytes on Linux, bytes on macOS
multiplier = 1 if os.uname().sysname == "Darwin" else 1024
return {
"peak_rss_bytes": usage.ru_maxrss * multiplier,
"user_cpu_seconds": usage.ru_utime,
"system_cpu_seconds": usage.ru_stime,
"minor_faults": usage.ru_minflt, # no disk I/O required
"major_faults": usage.ru_majflt, # required disk I/O (slow)
"voluntary_ctx_switches": usage.ru_nvcsw,
"involuntary_ctx_switches": usage.ru_nivcsw,
}
stats = memory_stats()
print(f"Peak RSS: {stats['peak_rss_bytes'] // (1024*1024)} MB")
print(f"User CPU: {stats['user_cpu_seconds']:.3f}s")
print(f"Major page faults: {stats['major_faults']}")
The /proc Filesystem
On Linux, /proc is a virtual filesystem that exposes kernel data structures as files. Reading /proc/self/status costs one open(2) + read(2) - negligible overhead, zero kernel memory allocation.
import os
def parse_proc_status(pid: str = "self") -> dict:
"""Parse /proc/<pid>/status into a dictionary."""
result = {}
try:
with open(f"/proc/{pid}/status") as f:
for line in f:
if ":" in line:
key, _, value = line.partition(":")
result[key.strip()] = value.strip()
except FileNotFoundError:
pass
return result
status = parse_proc_status()
print(f"VM RSS: {status.get('VmRSS', 'N/A')}") # resident set size
print(f"VM Peak: {status.get('VmPeak', 'N/A')}") # peak virtual memory
print(f"Threads: {status.get('Threads', 'N/A')}")
print(f"FD slots: {status.get('FDSize', 'N/A')}")
print(f"Voluntary ctx switches: {status.get('voluntary_ctxt_switches', 'N/A')}")
def list_open_fds(pid: str = "self") -> dict:
"""List all open file descriptors via /proc/<pid>/fd."""
fd_dir = f"/proc/{pid}/fd"
fds = {}
try:
for name in os.listdir(fd_dir):
link = os.path.join(fd_dir, name)
try:
target = os.readlink(link)
fds[int(name)] = target
except (PermissionError, FileNotFoundError, ValueError):
pass
except PermissionError:
pass
return fds
open_fds = list_open_fds()
print(f"\nOpen FDs: {len(open_fds)}")
for fd, target in sorted(open_fds.items())[:12]:
print(f" fd {fd:3d} -> {target}")
def parse_memory_maps(pid: str = "self") -> list:
"""Parse /proc/<pid>/maps to show all virtual memory regions."""
maps = []
try:
with open(f"/proc/{pid}/maps") as f:
for line in f:
parts = line.split()
if len(parts) >= 5:
maps.append({
"range": parts[0],
"perms": parts[1],
"offset": parts[2],
"name": parts[5] if len(parts) > 5 else "",
})
except FileNotFoundError:
pass
return maps
maps = parse_memory_maps()
print(f"\nMemory regions: {len(maps)}")
for m in maps[:10]:
print(f" {m['range']} {m['perms']} {m['name']}")
CPU Affinity and Scheduling Priority
For CPU-bound workers, pinning each process to a specific core improves L2/L3 cache locality and reduces context-switch overhead.
import os
# Get the set of CPUs this process is allowed to run on (Linux only)
affinity = os.sched_getaffinity(0) # 0 = current process
print(f"Allowed CPUs: {sorted(affinity)}")
# Pin to CPU 0 only
os.sched_setaffinity(0, {0})
print(f"Pinned to: {sorted(os.sched_getaffinity(0))}")
# Restore to all CPUs
os.sched_setaffinity(0, set(range(os.cpu_count())))
# Nice value: -20 (highest priority) to 19 (lowest)
# Raising priority (lowering nice) requires CAP_SYS_NICE (root)
current_nice = os.nice(0) # os.nice(n) increments by n and returns new value
print(f"Nice: {current_nice}")
os.nice(10) # deprioritize background worker
print(f"Nice after deprioritize: {os.nice(0)}")
# Pattern: pin each multiprocessing worker to its own CPU core
def pin_worker(worker_index: int) -> None:
"""Call at the start of each worker process."""
num_cpus = os.cpu_count()
cpu_id = worker_index % num_cpus
os.sched_setaffinity(0, {cpu_id})
print(
f"Worker {worker_index} (PID {os.getpid()}) pinned to CPU {cpu_id}",
flush=True,
)
Interview Q&A
Q1: What is the difference between os._exit() and sys.exit(), and when must you use os._exit()?
sys.exit() raises SystemExit, which triggers Python's normal teardown: atexit handlers run, __del__ methods are called on objects, and all io.BufferedWriter instances flush their buffers to their underlying file descriptors. os._exit() calls the _exit(2) syscall directly - the kernel immediately terminates the process without any Python-level cleanup.
You must use os._exit() in any code path that runs in a child process created by os.fork(). The child inherits all of the parent's open file descriptors and Python heap. If the child calls sys.exit(), it will: (1) flush shared output buffers, potentially writing duplicate data into log files or pipes the parent is also writing; (2) run atexit handlers that may close database sockets the parent still uses; (3) invoke __del__ on Python objects the parent still holds references to, causing use-after-free bugs or reference count corruption. The one exception: if the forked child immediately calls os.exec*(), the exec replaces the process image, making Python-level cleanup moot. Even then, it is cleaner and safer to always use os._exit() in the error path of a forked child.
Q2: Explain copy-on-write semantics after fork() and why Python's reference counting can make it problematic.
After fork(), the kernel marks all private memory pages as shared between parent and child, with copy-on-write semantics. Physical RAM is shared; the pages are mapped read-only in both address spaces. The first write to any page by either process triggers a hardware page fault; the kernel services the fault by copying the 4 KB page, mapping the copy into the writing process's address space, and resuming execution. Reads never trigger a copy.
The problem for CPython: reference counting writes ob_refcnt on every object access. Even reading a list element increments the list's refcount, which is stored in the same memory page as the list object. In a Python process with a 2 GB heap that forks 16 workers, each worker touching even a small fraction of the shared heap will trigger COW faults on those pages. Over time, if workers access most of the heap (e.g., a large in-memory dictionary), the physical memory usage approaches 2 GB × 16 = 32 GB - even though conceptually all workers are sharing the same read-only data. The mitigations are: (1) pre-load data before forking and minimize writes in children, (2) use multiprocessing.shared_memory.SharedMemory backed by POSIX shm, which bypasses Python's allocator and refcounting entirely, or (3) use immutable data structures that are less likely to have hot refcounts.
Q3: What constraints apply to Python signal handlers, and how does loop.add_signal_handler() differ from signal.signal()?
At the C level, signal handlers must be async-signal-safe - they can only call functions from the POSIX async-signal-safe list (write, _exit, signal, etc.). Python's signal.signal() works around this: the actual C handler just sets a _Py_atomic_int flag and returns. CPython's eval loop checks this flag between bytecodes and dispatches the Python handler in the main thread. This means Python signal handlers are always invoked in the main thread, between two bytecodes - not mid-malloc, not mid-lock-acquisition. However, you must still avoid: logging calls (logging uses threading locks that may be held), socket I/O, and large memory allocation (may trigger GC). Only setting flags, writing a byte to a pre-opened pipe, or calling threading.Event.set() is unconditionally safe.
loop.add_signal_handler() is asyncio's signal integration. It registers a callback with the event loop instead of the OS directly. Internally, asyncio maintains a self-pipe: the C signal handler writes a byte to the write end; the event loop polls the read end as a regular I/O event. When the byte arrives, the loop calls the registered Python callback in the event loop's context - meaning the callback can safely await, set asyncio.Event objects, and schedule coroutines. With signal.signal() in an asyncio program, the signal might arrive while the loop is blocked in epoll_wait(), and the subsequent Python handler call happens outside the event loop's normal callback queue - making it unsafe to interact with asyncio objects.
Q4: How do zombie processes form, and what are the consequences of not reaping them at scale?
When a child process exits, it moves to the zombie state (Z) - the kernel preserves its entry in the process table, containing the PID, exit status, and CPU accounting data, until the parent collects it with wait()/waitpid(). The child's memory, file descriptors, and address space are fully released; only the lightweight process table entry remains. If the parent never calls wait(), the zombie entry persists until the parent itself exits, at which point init (PID 1) inherits and reaps the orphan.
At scale, the consequence is PID exhaustion. Linux has a default pid_max of 32768 (/proc/sys/kernel/pid_max, configurable up to ~4 million on 64-bit). A server that forks a new child per request (pre-fork model) and never calls waitpid() will accumulate zombie entries at a rate equal to the request rate. When the PID table is full, fork() returns EAGAIN - no new processes can be created. The system appears to hang. The fix: register a SIGCHLD handler that calls os.waitpid(-1, os.WNOHANG) in a loop to drain all available zombie entries, or on Linux set signal.signal(signal.SIGCHLD, signal.SIG_IGN) to have the kernel auto-reap children.
Q5: What does RLIMIT_NOFILE control and how do you safely raise it in a production server?
RLIMIT_NOFILE is the per-process limit on simultaneously open file descriptors. Every TCP connection (client socket), listening socket, open file, pipe end, mmap region, epoll instance, and inotify watch consumes at least one FD. The default soft limit on most Linux distributions is 1024; the hard limit is typically 4096 or 65536. When a server reaches its FD limit, accept(), open(), and socket() calls fail with EMFILE. The failure manifests as HTTP 503 errors, database connection failures, or silent request drops - with the process appearing healthy.
The safe production pattern: at server startup (before binding the listen socket), call resource.getrlimit(resource.RLIMIT_NOFILE) to get soft and hard limits, then resource.setrlimit(resource.RLIMIT_NOFILE, (hard, hard)) to raise the soft limit to the hard limit. You can only raise the soft limit up to the hard limit without root privileges. In Docker and Kubernetes, the hard limit is configured via the ulimits container spec. In systemd services, set LimitNOFILE=65535 in the unit file. Additionally: always close FDs explicitly when done, use close_fds=True in subprocess.Popen, and audit for leaks with lsof -p PID | wc -l or by counting entries in /proc/self/fd.
